njuLink: results for instance matching at OAEI 2017
نویسندگان
چکیده
njuLink is a tool designed for instance matching. It mainly matches instances by finding discriminative property pairs. Also, to meet 1:1 equivalence relationship for the OAEI 2017 DORUMES task, we make several improvements. In this report, we describe the design ideas and show our evaluation results. 1 Presentation of the System 1.1 State, purpose, general statement With the rapid development of the Semantic Web, the amount of RDF data on the Semantic Web is growing in an unprecedented pace. This also brings great challenges to instance matching. On the Semantic Web, an instance describes a real-world object, it is composed of a subject and many 〈p, v〉 pairs, where p denotes a “property” and v denotes a “value”. Subject serves as unique token for a real-world object, and 〈p, v〉 pairs describe the features of this real-world object. Instance matching aims to find the instances that describe the same real-world object and establish links between them. If two instances describe the same real-world object, we consider them as coreferent instances or a coreferent instance pair. Thanks to a lot of existing work, e.g., the Linked Open Data (LOD) Initiative, millions of links have been established. But, there are still a huge number of instances that potentially refer to the same object but have not been interlinked yet. Our previous work tries to find coreferent instances by discriminative properties [2]. This approach is very effective but needs some improvements to meet the requirements of the DOREMUS task, which is to find 1:1 equivalence relationship between two datasets. So, we design njuLink, where “nju” represents “Nanjing University”. The key idea of njuLink lies in finding what is essential to determine whether two instances are coreferent. Driven by this, first, njuLink builds a small-scale training set via predicting coreferent and non-coreferent instance pairs. Then, by analyzing the value similarity of every instance pair in training set, njuLink finds some property pairs named discriminative property pairs, which have the ability to identify whether two instances are coreferent. Finally, for an instance pair, njuLink calculates the similarity of values based on the discriminative property pairs, the similarity of values based on common property pairs and the similarity of properties that they have to determine if the instances in this pair is coreferent. 1.2 Specific techniques used There are four steps in the workflow of njuLink, which is shown in Fig. 1. We will describe the strategies to calculate the similarity of values and the similarity of properties shortly. Fig. 1. The work flow of njuLink The task we participated in is to find coreferent instance pairs between two datasets. To make our descriptions more clear, we give some notations as follows: (1) Let D and D be two different datasets, respectively; (2) The elements with superscript x are from D and those with superscript y are from D, e.g., instances, properties and values in D are i, p and v, respectively; and (3) Every instance pair 〈i, i〉 mentioned in this article is composed of an instance i from D and an instance i from D, and i is written to the left and i is written to the right, this also applies to property pairs 〈p, p〉 and value pairs 〈v, v〉. Preprocess Data. For an instance, njuLink preprocesses the values describing it. There are three types of values: Blank node, URI and Literal (plain or typed). If a value is blank node, njuLink ignores it. Literal is divided into two kinds: typed literal, like boolean and integer, and plain literal, which is often accompanied with a language tag. First, njuLink records the type of each value. Then, if the value has a language tag, njuLink also records it. Thirdly, for literals, njuLink replaces punctuations and stop words like “at”, “in”, “for” with space by a NLP tool, and then njuLink removes all space. For URIs, njuLink only records its local name. Finally, njuLink transforms subjects, properties and values to lowercase letters and stores them for the next step. Strategies to Calculate Similarity. We describe our strategies to obtain the similarity of a value pair and the similarity of a property pair next. Calculate similarity of a value pair. Let v and v be two values owned by properties p and p, respectively. First, njuLink judges whether v and v are meaningful to be compared. There are three situations under which comparing them are not meaningful: (1) They both have language tags and their language tags are different; (2) The types of them are different; and (3) One of them is blank node. Second, let T (v) be the type of v. If v and v are meaningful to be compared, the strategies to find their similarity, denoted by V alSim(v, v | p, p), vary with their types: V alSim(v, v | p, p) = { indicatorFunc(v, v), T (v) = typed literal I-Sub(v, v), otherwise (1) where for typed literal, njuLink uses indicator function (indicatorFunc(v, v)) to get their similarity, e.g., when two literals are both date time type, their similarity is 1 if the two literals are equal, and 0 otherwise. For URI and plain literal, njuLink uses I-Sub [3] to calculate the similarity. When the similarity of v and v is higher than a threshold, they are considered as a similar value pair. The threshold is set to 0.65, which is suggested by the authors of I-Sub [3]. Calculate similarity of a property pair. Let p and p be two properties owned by instances i and i, respectively. A property may have more than one value, we let the sets of values of p and p be V al(p, i) and V al(p, i), respectively. First, we find value set that has a smaller size. Without loss of generality, we assume that V al(p, i) is the smaller one here. For a value v in V al(p, i), the maximum similarity between it and the values in V al(p, i) is calculated by MaxV alSim(v, V al(p, i)). The maximum similarity between values of V al(p, i) and V al(p, i), which is also considered as the maximum similarity of property pair 〈p, p〉, is denoted by MaxPropSim(p, p | i, i): MaxV alSim(v, V al(p, i)) = max v n∈V al(p ,i) V alSim(v, v n | p, p), (2) MaxPropSim(p, p | i, i) = max v m ∈V al(p,i) MaxV alSim(v m, V al(p , i)). (3) If MaxV alSim(v, V al(p, i)) of v is higher than a threshold (i.e. 0.65), value v is considered as a matched value, we define the sets of matched values and unmatched values between p of i and p of i as follows: MatV al(p, p | i, i) = {v | v ∈ V al(p, i) ∩MaxV alSim(v, V al(p, i))>0.65)}, (4) UnmatV al(p, p | i, i) = {v | v ∈ V al(p, i) ∩ v / ∈MatV al(p, p | i, i)}. (5) If MaxPropSim(p, p | i, i) is higher than a threshold (0.65), the property pair 〈p, p〉 is similar w.r.t. instance pair 〈i, i〉. Note that this property pair is not guaranteed to be similar in another instance pair. For every matched value v of V al(p, i), we sum up its similarity by MatV alSimSum(p, p | i, i): MatV alSimSum(p, p | i, i) = ∑ v m∈MatV al( p,p | i,i) MaxV alSim(v m, V al(p , i)).
منابع مشابه
Results of AML in OAEI 2017
AgreementMakerLight (AML) is an automated ontology matching system that was developed with both extensibility and efficiency in mind. This paper describes its configuration for the OAEI 2017 competition and discusses its results. For this OAEI edition, we built upon the instance matching foundations we laid last year, and tackled the new Hobbit track and its new evaluation platform. AML was the...
متن کاملI-Match and OntoIdea results for OAEI 2017
Presenting a set of similar or diverse ideas during the idea generation process leads ideators to come-up with more creative and diverse ideas. However, to better assess the similarity between the ideas, we designed two matching systems, namely I-Match and OntoIdea. In the context of the idea generation process, each idea is represented by a set of instances from DBpedia describing the main con...
متن کاملInsMT+ results for OAEI 2015 instance matching
The InsMT+ is an improved version of InsMT system participated at OAEI 2014. The InsMT+ an automatic instance matching system which consists in identifying the instances that describe the same real-world objects. The InsMT+ applies different string-based matchers with a local filter. This is the second participation of our system and we have improved somehow the results obtained by the previous...
متن کاملOAEI 2016 results of AML
AgreementMakerLight (AML) is an automated ontology matching system based primarily on element-level matching and on the use of external resources as background knowledge. This paper describes its configuration for the OAEI 2016 competition and discusses its results. For this OAEI edition, we tackled instance matching for the first time, thus expanding the coverage of AML to all types of ontolog...
متن کاملCroLOM results for OAEI 2017: summary of cross-lingual ontology matching systems results at OAEI
This paper presents the results obtained in the OAEI 2017 campaign by our ontology matching system CroLOM. CroLOM is an automatic system especially designed for aligning multilingual ontologies. This is our second participation with CroLOM in the OAEI and the results have so far been positive.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017